This analysis will be exploring a dataset of about 1.5 million beer reviews from BeerAdvocate. In addition, the database/API called Open Brewery DB is also used for more detailed brewery data.
A few notes about the dataset:
weighted_review: engineered variable that calculates a weighted average of all 5 review categories. This is the primary review score that is listed on BeerAdvocate's website. Each weight is defined by BeerAdvocate as 20%, 24%, 6%, 10% and 40% respectively to the following review categories. review_overall: overall review score given for overall impression of the beer. review_aroma: review score given for any malt, hops, yeast, and other aromatics present in the beer.review_appearance: review score given for the beer's color, clarity, head retention, and lacing. review_palate: review score given for the beer's body, carbonation, warmth, creaminess, astringency and other palate sensations. review_taste: review score given for any malt, hops, fermentation byproducts, balance, finish or aftertaste and other flavor characteristics.import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
beer_df = pd.read_csv('beer_reviews.csv')
beer_df
beer_df.info()
Let's clean up our dataframe a bit, by removing some unwanted columns and dealing with missing values.
# remove unwanted columns, and rearrange some columns.
beer_df_clean = beer_df.drop(columns=['review_time', 'review_profilename'])
beer_df_clean = beer_df_clean[['brewery_id', 'brewery_name', 'beer_beerid', 'beer_name', 'beer_style', 'beer_abv',
'review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste']]
beer_df_clean
# deal with missing values in the beer_abv column.
beer_df_clean.beer_abv.isna().sum()
beer_df_clean[beer_df_clean['beer_abv'].isna()]
# drop these rows.
beer_df_clean.dropna(subset=['beer_abv'], inplace=True)
beer_df_clean.info()
We also have a few missing values in the brewery_name column. Let's go ahead and drop those as well.
beer_df_clean.dropna(subset=['brewery_name'], inplace=True)
beer_df_clean.info()
beer_df_clean
That looks better.
Now that we have our df cleaned up a bit, I am interested in trying to get some more data for breweries, such as location and type of brewery.
My idea is to see if there is a city or state that has a high correaltion with overall rating. Then I want to try and make a geospatial visualization of which states or cities receive the highest ratings.
I will be using a python wrapper for the Open Brewery DB API, called openbrewery, to get this data.
# import openbrewerydb as ob
# brewery_df = ob.load()
# brewery_df.to_csv('openbrewerydb.csv', index=False)
# brewery_df
Since the above import statement won't work unless you install the wrapper, I saved the data to a csv in this project file.
brewery_df = pd.read_csv('openbrewerydb.csv')
brewery_df
# rename 'name' column.
brewery_df.rename(columns = {'name':'brewery_name'}, inplace=True)
# number of brewery name matches between our two dataframes.
a = beer_df_clean['brewery_name'].unique().tolist()
b = brewery_df['brewery_name'].unique().tolist()
matches = list(set(a).intersection(b))
len(matches)
Ok so there are quite a few breweries that didn't match between the two dataframes. However, there is a good chance that there are more breweries that do match but there is some small difference in the name, for example 'Co.' instead of 'Company'.
My first idea was to loop through
beer_df_clean.brewery_namecolumn, split the first name from each brewery, and then search for breweries with this first name inbrewery_df.brewery_name.Unfortunately this plan didn't workout. Therefore I will just do some physical searching of similar brewery names, with as much programmatic help wherever possible.
# first we need a list of breweries that did not initially match from the beer dataframe.
brewery_temp = beer_df_clean[~beer_df_clean.brewery_name.isin(matches)]
brewery_temp
# same thing from the brewery dataframe.
brewery_temp2 = brewery_df[~brewery_df.brewery_name.isin(matches)]
brewery_temp2
These two dataframes provide us with all of the breweries that did not initially match. The first dataframe
brewery_tempcontains breweries from our main dataframebeer_df_clean. The second dataframebrewery_temp2contains breweries frombrewery_df, which contains the additional brewery data we want in the main dataframe.Now I can create lists of brewery names from each dataframe, export them as a csv, sort them, and open the full lists in Notepad++ to physically analyze them next to each other. I will find the breweries that are present in both lists, but have a small difference in the name which didn't allow them to be easily programmatically matched. I will then save the matches I find as two new lists. I can use these two new lists to make a dictionary to rename the breweries in the
brewery_dfso that they can match the brewery names inbeer_df_clean. Finally, I can rerunlist(set(a).intersection(b))and hopefully have more than 268 matches.
# create lists of these breweries
c = brewery_temp.brewery_name.unique().tolist()
d = brewery_temp2.brewery_name.unique().tolist()
# preview of our list. This is a list of every brewery that did not initially match.
sorted(c)
df_names1 = pd.DataFrame(sorted(c), columns=['names_1'])
df_names1.to_csv('names1.csv', index=False)
sorted(d)
df_names2 = pd.DataFrame(sorted(d), columns=['names_2'])
df_names2.to_csv('names2.csv', index=False)
Ok, so now I have two csv files with unique brewery names that I can compare in Notepad++.
Once I have done this, I can read in new csv files that list the matches I found.
df_names1_analyzed = pd.read_csv('names1_analyzed.csv')
df_names1_analyzed
df_names2_analyzed = pd.read_csv('names2_analyzed.csv')
df_names2_analyzed
Sweet, so I was able to find 285 additional breweries. This process was a lot less painful than I imagined, but I am sure that there is a better way to go about doing this.
If my reviewer has any suggestions about an alternative approach I could of taken that would be awesome!!
# create dictionary from these brewery names.
names_dict = dict(zip(df_names2_analyzed['names_2'], df_names1_analyzed['names_1']))
names_dict
Now we can use
names_dictto rename the brewery names inbrewery_df.
brewery_df.brewery_name.replace(names_dict, value=None, inplace=True)
# did this work?
brewery_df.query("brewery_name == '(512) Brewing Company'")
Looks like it worked. Now we can rerun the
matchescode.
x = beer_df_clean['brewery_name'].unique().tolist()
y = brewery_df['brewery_name'].unique().tolist()
matches = list(set(x).intersection(y))
len(matches)
Awesome, we were able to match all 285 breweries, 268 (initial matches) + 285 (additional matches) = 553.
Now we can merge these matched breweries to
beer_df_clean, so that we can have location data for each brewery in our main dataframe. This will allow us to make the geospatial viz.
beer_df_clean = beer_df_clean.merge(brewery_df, on='brewery_name')
beer_df_clean
# number of unique beers with ratings
len(beer_df_clean.beer_name.unique())
So, I've decided to merge
beer_df_cleanandbrewery_dfusing an inner join. This will drop my number of rows down to 563,966 from 1,518,814.For the purposes of this analysis, this is more than enough data. Especially since there are over 9,000 unique beers with ratings, and over 550 unique breweries with location data.
Let's make some final touches to finish cleaning
beer_df_clean.
beer_df_clean.columns
beer_df_clean.drop(columns=['id', 'street', 'postal_code', 'country', 'phone',
'website_url', 'updated_at'], inplace=True)
# rearrange column headers.
beer_df_clean = beer_df_clean[['beer_beerid', 'beer_name', 'beer_style', 'beer_abv', 'review_overall',
'review_aroma', 'review_appearance', 'review_palate', 'review_taste',
'brewery_id', 'brewery_name', 'brewery_type', 'city', 'state',
'longitude', 'latitude']]
beer_df_clean
# We don't need two unique id's.
beer_df_clean.drop(columns=['brewery_id'], inplace=True)
# just noticed this odd column name.
beer_df_clean.rename(columns={'beer_beerid':'beer_id'}, inplace=True)
# double check for any NaN values.
beer_df_clean.isna().sum()
169 rows have NaN values for city and state. Let's drop these rows.
But let's not worry about the longitude and latitude NaN values, because I am not sure if I will be using these columns or not yet.
beer_df_clean.dropna(subset=['city', 'state'], inplace=True)
beer_df_clean.isna().sum()
beer_df_clean
Dataframe looks good.
Last thing I want to do is some feature engineering to create the variable,
weighted_review, which will represent a weighted average across all 5 review categories,review_overall,review_aroma,review_appearance,review_palate,review_taste. Refer to Preliminary Wrangling for the weights used and category definitions.This column will act as the true review rating given by each reviewer per beer. This is the same weighted average that BeerAdvocate uses for each user generated review on their website.
def weighted_review(x):
return x['review_overall']*0.2 + x['review_aroma']*0.24 + x['review_appearance']*0.06 + x['review_palate']*0.1 + x['review_taste']*0.4
beer_df_clean['weighted_review'] = beer_df_clean.apply(weighted_review, axis=1)
beer_df_clean
# rearrange columns
beer_df_clean = beer_df_clean[['beer_id', 'beer_name', 'beer_style', 'beer_abv', 'weighted_review', 'review_overall',
'review_aroma', 'review_appearance', 'review_palate', 'review_taste', 'brewery_name',
'brewery_type', 'city', 'state','longitude', 'latitude']]
beer_df_clean
- After a bit of cleaning, our beer dataset is has 563,797 rows, across 16 columns.
- The dataset includes 16 features, with a mixture of quantitative and qualitative features.
- The values for each feature are the correct data type for our analysis. All numeric values are either floats or integers, and all categorical values are objects/strings.
- A final note to make is that there are a number of beers that have multiple reviews. Therefore we will keep this in mind when doing bivariate and multivariate analysis. For example, if we want to create a bar chart ranking the top 10 highest rated beers, we will need to aggregate the ratings for each beer before plotting.
- Additional brewery data was also merged into the main dataframe
beer_df_cleanfrom the Open Brewery Database. The variables we pulled from this database includedbrewery_type,city,state,longitude, andlatitude.- Lastly, we have one engineered feature
weighted_review, which is a weighted average of the 5 review categories that were in the original dataframebeer_df.
CHANGE THIS STUFF BELOW The aim of the analysis for this dataset is to find out what features have the highest correlation with the overall review score of beers. I am most curious to see if there is a specific brewery, brewery location, or beer style that receives higher than average ratings. I am also curious to see how alcohol content in beer is correlated with overall rating. Fianlly, I am also interested to see which review sub-category: aroma, appearance, palate, or taste, is most highly correlated with overall rating.
I expect the sub-category ratings to be most highly correlated with the overall rating of each beer, however there's really no saying which one will have the highest correlation. But I do believe that all of their correlations will be realtively similar, and no one category will stand out. For me, I am most interested in seeing what insights we can find in the categorical features: brewery, state, brewery type, and beer style.
First we'll do some exploration of the quantitative variables in the dataset.
We'll take a look at:
beer_abv,review_overall,review_aroma,review_appearance,review_palate, andreview_taste.
beer_df_clean
# beer_abv distribution
sns.histplot(data=beer_df_clean, x='beer_abv');
Alright so it looks like we have some outliers for
beer_abv. Let's look into these..
abv_stats = beer_df_clean.beer_abv.describe()
abv_stats
I am going to define outliers using the 1.5 x IQR rule.
# Find limits for the 1.5xIQR rule.
IQR = abv_stats[6] - abv_stats[4]
lower_limit = abv_stats[4] - (1.5 * IQR)
upper_limit = abv_stats[6] + (1.5 * IQR)
print(f'Outlier Lower Limit = {lower_limit} \n\
Outlier Upper Limit = {upper_limit}')
Now that we have our lower and upper limits to define the outliers, we will use these values to filter our data and then re-plot the distribution.
beer_df_clean.query("beer_abv < 0.25")
beer_df_clean.query("beer_style == 'Low Alcohol Beer'").count()['beer_abv']
beer_df_clean.query("beer_style == 'Low Alcohol Beer'")['brewery_name'].unique()
Alright so this is interesting. There are only two data points that lie below the lower limit of 0.25. We can also see that
beer_style==Low Alcohol Beer. There are also 277 other data points with this beer style. However, they are all brewed by the same brewery, Anheuser-Busch.Since low alcoholic beer has a very specific target market, it doesn't really fit within the purposes of this analysis. The purpose of this analysis is to gain insight into different beers for people who enjoy drinking it, without any thought or worry of its alcohol content. In other words, alcohol content isn't a deciding factor for choosing to drink a specific beer. So it doesn't make sense to include beers that are for people who do pay attention to alcohol content, and for them it is a deciding factor.
Having said that, I will drop all
Low Alcohol Beerdata from the dataset.
beer_df_clean = beer_df_clean.query("beer_style != 'Low Alcohol Beer'")
beer_df_clean
# check if rows were dropped.
beer_df_clean.query("beer_style == 'Low Alcohol Beer'")
Ok, so we've taken care of the outliers below the lower limit, let's take a look at outlier above the upper limit.
beer_df_clean.query("beer_abv > 14.25")
Ok, so over 10,000 data points can be defined as an "outlier". However, we really shouldn't drop these values just because they are outliers. So let's look into some of these data points in more detail.
#how many data points do we have for each abv value > 14.25?
beer_df_clean.query("beer_abv > 14.25").beer_abv.value_counts().sort_index()
It doesn't look like there are any values that are completely outlandish. There is a fairly consistent progression of values.
Let's take a look at the highest abv value, 41, and make sure that it is actually a real beer.
beer_df_clean.query("beer_abv == 41")
It looks like, Sink The Bismark! is a real beer, and was the strongest beer in the world when it was released.
So since there isn't a good reason to drop these values, other than that they are outliers, I think the best course of action would be to keep the values in our dataset, and just change the
binrangefor the distribution.In addition, when we get to bivariate and multivariate analysis, it might be a good idea to do show results with and without these outliers.
The last thing we need to do before re-plotting our data is take into account beers with multiple reviews. This is important because beers with multiple reviews have the same
beer_abvvalue, and are getting counted multiple times, which could be skewing the data.Therefore, what we need to do is filter out any beers that have multiple reviews, and just plot a single
beer_abvvalue for each one. Let's do that now.
abv_unique = beer_df_clean.drop_duplicates(subset=['beer_name', 'beer_style', 'beer_abv', 'brewery_name', 'city', 'state'])
abv_unique
This dataframe,
abv_unique, will now only count a singlebeer_abvvalue for each unique beer isbeer_df_clean.We can now go ahead and use
abv_uniqueto plot the distribution.
# re-plot distribution for beer_abv.
sns.histplot(data=abv_unique, x='beer_abv', binwidth=0.5, binrange=(2,14));
Here, I set the
binrangeto the outlier range we caluclated before. Smaller bin sizes didn't give any more useful insight. Slight right skew with most data points in the 4 to 10 abv range.Next, lets deal with the right skew of this distribution.
First, I want to take a look at a few different transformations and compare their skew. This will give us a general sense of which transformation is best to use.
# skew of original data.
abv_unique.beer_abv.skew()
This is the skew of the
beer_abvcolumn inabv_unique. Let's apply some transformations and see if we can get that skew value closer to 0. Since we are dealing with a right skew, we will compare roots and log transformations.
print(f'''square root transformation: {np.sqrt(abv_unique.beer_abv).skew()}
cube root transformation: {np.cbrt(abv_unique.beer_abv).skew()}
log transformation: {np.log10(abv_unique.beer_abv).skew()}
''')
Log transformation looks like the way to go here.
log_binsize = 0.035
bins = 10 ** np.arange(0.3, 1.3 + log_binsize, log_binsize)
sns.histplot(data=abv_unique, x='beer_abv', bins=bins)
plt.xscale('log')
tick_locs = [2, 5, 10, 20]
plt.xticks(tick_locs, tick_locs);
This looks a bit better now.
Now let's take a look at our review scores:
weighted_review,review_overall,review_aroma,review_appearance,review_palate, andreview_tasteFirst I want to take a look at some general details about these variables before plotting. We probably should of done this in the first section, but oh well.
beer_df_clean[['weighted_review', 'review_overall', 'review_aroma',
'review_appearance', 'review_palate', 'review_taste']].describe()
So it looks like the variables
review_overallandreview_appearancehave a min review score of 0. This must be an input error since the lowest review score you can give on BeerAdvocate is 1. Therefore, let's drop any rows that have review scores of 0.
review_zero = beer_df_clean.query('review_overall == 0')
review_zero
beer_df_clean.drop(index=review_zero.index, inplace=True)
# check that rows were dropped.
beer_df_clean.query('review_overall == 0')
Now with that taken care of, let's start with plotting the distribution for
weighted_review, which again is a weighted average score for the other 5 review categories:review_overall,review_aroma,review_appearance,review_palate, andreview_taste.
sns.histplot(data=beer_df_clean, x='weighted_review', binwidth=0.1)
plt.title('Weighted Average Review Score Distribution')
plt.xlabel('Review score');
Unimodal distribution with a left skew, sweet. Let's move onto the other 5 review categories.
fig, (ax1, ax2, ax3) = plt.subplots(3,2, figsize=(12,15))
sns.histplot(data=beer_df_clean, x='review_overall', ax=ax1[0], binwidth=0.5)
ax1[0].set_title('Overall Review Score Distribution')
ax1[0].set_xlabel('Review score')
sns.histplot(data=beer_df_clean, x='review_aroma', ax=ax1[1], binwidth=0.5)
ax1[1].set_title('Aroma Review Score Distribution')
ax1[1].set_xlabel('Review score')
sns.histplot(data=beer_df_clean, x='review_appearance', ax=ax2[0], binwidth=0.5)
ax2[0].set_title('Appearance Review Score Distribution')
ax2[0].set_xlabel('Review score')
sns.histplot(data=beer_df_clean, x='review_palate', ax=ax2[1], binwidth=0.5)
ax2[1].set_title('Palate Review Score Distribution')
ax2[1].set_xlabel('Review score')
sns.histplot(data=beer_df_clean, x='review_taste', ax=ax3[0], binwidth=0.5)
ax3[0].set_title('Taste Review Score Distribution')
ax3[0].set_xlabel('Review score')
ax3[1].axis('off')
plt.tight_layout()
The distributions for the these sub-category review scores have a similar distribution and skew to the weighted review scores.
This tells us a couple of things. One, these sub-categories will likely have a strong correlation with each other. Two, we will most likely want to apply the same transformation to each of these distributions. A sqaured or cubed transformation would make sense since we are dealing with a left skew.
We will most likely have to use matplotlib to create our own scale function, since it doesn't have a squared or cubed scale built in.
Another thing to note here is the binsize change. By chance, every review score in the dataset was given in an interval of 0.5, i.e. 0.5, 1.0, 1.5, etc. However, on BeerAdvocate a reviewer can give any score between 1-5, including decimals. Because of this, I have been treating these 5 review variables as quantitative variables, even though their is an argument to say that they are qualitative.¶
Next, let's compare the skew of squared and cubed transformations for each review category, and then will work on the scale transformation.
# Overall review skew comparison.
print(f'''Weighted Average Reviews
------------------
original data skew: {beer_df_clean.weighted_review.skew()}
squared transformation: {np.square(beer_df_clean.weighted_review).skew()}
cubed transformation: {np.power(beer_df_clean.weighted_review, 3).skew()}
Overall Reviews
------------------
original data skew: {beer_df_clean.review_overall.skew()}
squared transformation: {np.square(beer_df_clean.review_overall).skew()}
cubed transformation: {np.power(beer_df_clean.review_overall, 3).skew()}
Aroma Reviews
------------------
original data skew: {beer_df_clean.review_aroma.skew()}
squared transformation: {np.square(beer_df_clean.review_aroma).skew()}
cubed transformation: {np.power(beer_df_clean.review_aroma, 3).skew()}
Appearance Reviews
------------------
original data skew: {beer_df_clean.review_appearance.skew()}
squared transformation: {np.square(beer_df_clean.review_appearance).skew()}
cubed transformation: {np.power(beer_df_clean.review_appearance, 3).skew()}
Palate Reviews
------------------
original data skew: {beer_df_clean.review_palate.skew()}
squared transformation: {np.square(beer_df_clean.review_palate).skew()}
cubed transformation: {np.power(beer_df_clean.review_palate, 3).skew()}
Taste Reviews
------------------
original data skew: {beer_df_clean.review_taste.skew()}
squared transformation: {np.square(beer_df_clean.review_taste).skew()}
cubed transformation: {np.power(beer_df_clean.review_taste, 3).skew()}
''')
Based on these skew values a cubed transformation looks like the best option across all review categories.
So let's just go ahead and create a cubed scale for the x-axis of the review distributions. Matplotlib doesn't have a 'cubed' option for
plt.xscale, but we can use the 'function' option ofset_xscaleto apply our own function. Pretty cool.
Before we apply a cubed scale to the review distributions, I want to test the code out on some dummy data.
from matplotlib.ticker import FixedLocator
# define cubed function to apply to scale.
def forward(x):
return x**3
def inverse(x):
return np.sign(x) * (np.abs(x)) ** (1 / 3)
# plot dummy data using defined cubed function.
dummy_data = np.cbrt(np.arange(0,9))
fig, ax = plt.subplots(1,1, figsize=(6,5))
ax.plot(dummy_data, label='$\sqrt[3]{x}$')
ax.set_yscale('function', functions=(forward, inverse))
ax.yaxis.set_major_locator(FixedLocator(np.arange(0, 10, 1)**(1/3)))
ax.legend(fontsize=20);
# plotted y-values to help with interpretation.
np.cbrt(np.arange(0,9))
That worked nicely.
We purposefully plotted a cubed root function so that the resulting plot would be linear after applying the cubed function to the y-scale. It is much easier to interpret what is going on with this visualization.
If you take a look at the plot without the y-scale function applied, you can see the effect of the cubed function.
dummy_data = np.cbrt(np.arange(0,9))
fig, ax = plt.subplots(1,1, figsize=(6,5))
ax.plot(dummy_data);
Now, let's try to apply this function to the review distributions.
from matplotlib.ticker import FixedLocator
# define cubed function to apply to scale.
def forward(x):
return x**3
def inverse(x):
return np.sign(x) * (np.abs(x)) ** (1 / 3)
binsize = 5
bins = np.arange(0, 130, binsize)**(1/3)
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6))
sns.histplot(beer_df_clean, x='weighted_review', bins=bins, ax=ax1)
ax1.set_xscale('function', functions=(forward, inverse))
ax1.xaxis.set_major_locator(FixedLocator(np.arange(0, 130, 25)**(1/3)))
ax1.set_title('FixedLocator Ticks')
ax1.set_xlabel('Review score')
sns.histplot(beer_df_clean, x='weighted_review', bins=bins, ax=ax2)
ax2.set_xscale('function', functions=(forward, inverse))
tick_locs1 = [1, 2, 2.5, 3, 3.5, 4, 4.5, 5]
plt.xticks(tick_locs1, tick_locs1)
ax2.set_title('True Value Ticks')
ax2.set_xlabel('Review score')
plt.suptitle('Weighted Average Review Score Distribution', fontsize=15);
Original distribution for comparison.
Right away, we can see that the cubed transformation does reduce the skew of the distribution.
I included two plots of the transformation for comparison:
- ax1: uses the matplotlib
FixedLocatorsubclass from thetickermodule to set tick marks. This evenly distributes the tick marks across the bins of the distribution. It is more aesthetically pleasing, however it can cause some difficultly in interpretation, since our review scores are in intervals of 0.5.- ax2: uses
plt.xticksto set tick marks that correspond to the true values of review scores, i.e. intervals of 0.5. The spacing of the tick marks on this plot can also make it difficult to interpret. However, we get the benefit of visualizing values that make sense to us.Now we can apply the cubed transformation to the rest of the review categories.
fig, axes = plt.subplots(5,2, figsize=(12,25), sharey=True)
categories = ['review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste']
for i, row in enumerate(axes):
binsize = 25
bins = np.arange(0, 130, binsize)**(1/3)
# FixedLocator ticks plot
sns.histplot(beer_df_clean, x=categories[i], bins=bins, ax=row[0])
row[0].set_xscale('function', functions=(forward, inverse))
row[0].xaxis.set_major_locator(FixedLocator(np.arange(0, 130, 25)**(1/3)))
row[0].set_xlabel('Review Score')
row[0].set_title('FixedLocator Ticks')
# True value ticks plot
sns.histplot(beer_df_clean, x=categories[i], bins=bins, ax=row[1])
row[1].set_xscale('function', functions=(forward, inverse))
tick_locs = [1, 2, 2.5, 3, 3.5, 4, 4.5, 5]
row[1].set_xticks(tick_locs)
row[1].set_xticklabels(tick_locs)
row[1].set_xlabel('Review Score')
row[1].set_title('True Value Ticks')
# plt.subplots_adjust(hspace=3)
plt.figtext(0.38, 1, 'Overall Review Score Distribution', fontsize=15, weight=20)
plt.figtext(0.38, 0.8, 'Aroma Review Score Distribution', fontsize=15, weight=20)
plt.figtext(0.38, 0.6, 'Appearance Review Score Distribution', fontsize=15, weight=20)
plt.figtext(0.38, 0.4, 'Palate Review Score Distribution', fontsize=15, weight=20)
plt.figtext(0.38, 0.19, 'Taste Review Score Distribution', fontsize=15, weight=20)
plt.tight_layout(h_pad=5)
Alright, so now we have addressed the skew of each review distribution. Keep in mind that normalizing the distributions like this isn't always necessary for your analysis. However, I figured that it would be good practice, and it may be helpful later on.
Lastly, I want to take a look at a few qualitative variables in our dataset.
I want to take a look at the distributions of states where each beer is brewed, beer name, beer style, and brewery type.
# Let's see how many different states we are working with.
beer_df_clean.state.nunique()
Alright, we have all 50 states and D.C. represented in the dataset. It will be a tight squeeze, but I think it will be worth looking at the full distribution.
We also have to remember that each brewery may brew multiple beers, and each beer may have multiple reviews. Therfore each state will be counted multiple times. To deal with this, we can drop all brewery duplicates, and then plot the states where those breweries are located.
state_unique = beer_df_clean.drop_duplicates(subset=['brewery_name', 'brewery_type', 'city', 'state'])
state_unique
state_unique.state.value_counts()
# still dealing with 51 states?
state_unique.state.nunique()
Now, we can go ahead and plot this values.
plt.figure(figsize=(12,14))
sns.countplot(data=state_unique, y='state', palette='Set3');
Let's now try to put these in descending order.
state_order = state_unique.state.value_counts().index
plt.figure(figsize=(12,14))
sns.countplot(data=state_unique, y='state', palette='Blues_r', order=state_order);
Last thing I want to do is to annotate the value counts for each state on the plot.
plt.figure(figsize=(12,14))
ax = sns.countplot(data=state_unique, y='state', palette='Blues_r', order=state_order)
plt.xlim(0,90)
plt.title('Most Popular Beer States')
sns.despine();
for p in ax.patches:
bar_end = p.get_width()
x = p.get_x() + p.get_width() + 0.3
y = p.get_y() + p.get_height() / 1.35
ax.annotate(format(bar_end, ',d'), (x,y))
Alright this looks better, and it gives us an idea of where a majority of breweries are located in the U.S., which may have some indication on the popularity of beer in that state.
We can see that California is a clear standout, with states like Oregon, New York, Washington, and Michigan also having a high number of breweries. It will be interesting to see which of these states also produce highly rated beers.
Now let's take a look at the most popular beer styles in the dataset.
With beer styles, we face the same problem of duplicates as we did with brewery states. Since each beer can have multiple reviews, beer styles may be counted multiple times for a single beer. Therefore, in order to get accurate counts for our plot, we will need to drop these duplicates.
style_unique = beer_df_clean.drop_duplicates(subset=['beer_name', 'beer_style', 'brewery_name'])
style_unique
Looks good, but its possible that we are missing beer style counts where 2 or more breweries have the same name, but are seperate breweries. Either because they are different brewery types, or are located in different cities/states.
So we will include breweries in our subset columns just to cover these edge cases.
style_unique = beer_df_clean.drop_duplicates(subset=['beer_name', 'beer_style',
'brewery_name', 'brewery_type', 'city', 'state'])
style_unique
Ok, so it looks like our worries were justified since the number of rows returned increased by almost 1,000 to 11,180.
Now, let's go ahead and plot.
pd.Series(style_unique.beer_style.value_counts().values).describe()
Conveniently there are 101 unique beer styles in
style_unique. So, we'll take a look at the upper quartile, which is about the top 25 beer styles.
# find beer styles in upper quartile.
top25_styles = style_unique.beer_style.value_counts().head(25).index
top25_styles
style_unique.beer_style.value_counts().head(25)
plt.figure(figsize=(9,10))
ax = sns.countplot(data=style_unique, y='beer_style', order=top25_styles, palette='Blues_r')
plt.xlim(0, 900)
plt.title('Top 25 Most Popular Styles of Beer')
sns.despine();
for p in ax.patches:
bar_end = p.get_width()
x = p.get_x() + p.get_width() + 5
y = p.get_y() + p.get_height() / 1.4
ax.annotate(format(bar_end, ',d'), (x,y))
Looks good.
Next, let's move onto types of breweries.
First I want to define each brewery type, as defined by Brewer's Association:
- Large: A brewery with an annual beer production greater than 6,000,000 barrels.
- Regional: A brewery with an annual beer production of between 15,000 and 6,000,000 barrels.
- Micro: A brewery that produces less than 15,000 barrels of beer per year and sells 75 percent or more of its beer off-site.
- Brewpub: A restaurant-brewery that sells 25 percent or more of its beer on-site and operates significant food services.
- Contract: A business that hires another brewery to produce its beer. It can also be a brewery that hires another brewery to produce additional beer.
- Proprietor: A licensed tenant brewery that physically takes possession of a shared brewery while brewing.
- Planning: Any independent or home brewery that is planning to expand their brewery into a brewpub, microbrewery, etc.
Next, let's filter for unique breweries so that we aren't recounting brewery types. To insure that we don't run into a similar problem like we did with beers types, where several beers have the same name and beer style combination, we will filter for brewery types with a unique,
brewery_name,brewery_type,city, andstate.Filtering with these subset columns will make sure that we only count brewery types with a unique brewery name, in a unique city, and in a unique state. This maybe to a bit overkill, however it will deal with certain edge cases, for example:
- Two breweries named 'X brewery', are both brewpubs, and are both located in New York.
- However, one of the breweries is in NYC, New York, and the other is in Albany, New York.
- In this case, we want this to be counted as 2 brewpubs. Having the four subset columns listed above will do this for us.
With that said, let's filter and plot.
brewery_unique = beer_df_clean.drop_duplicates(subset=['brewery_name', 'brewery_type', 'city', 'state'])
brewery_unique
brewery_unique.brewery_type.value_counts()
brewery_order = brewery_unique.brewery_type.value_counts().index
plt.figure(figsize=(9,8))
ax = sns.countplot(data=brewery_unique, y='brewery_type', order=brewery_order, palette='Blues_r')
plt.title('Most Popular Types of Breweries')
sns.despine()
for p in ax.patches:
bar_end = p.get_width()
x = p.get_x() + p.get_width() + 2
y = p.get_y() + p.get_height() / 1.8
ax.annotate(format(bar_end, ',d'), (x,y))
Alright, so we can see that a majority of breweries reviewed are micro breweries and brewpubs. This seems makes sense, since people who visit or drink beer from micro breweries and/or brewpubs are more likely to be beer enthusiasts who will have an opinion, i.e. a review, on the beers they drink.
Oh, I almost forgot that I wanted to look at the most popular beers based on number of reviews.
# plot top 25 most popular beers
beer_df_clean.beer_name.value_counts().head(25)
# plot order
top25_beers_order = beer_df_clean.beer_name.value_counts().head(25).index
plt.figure(figsize=(9,10))
ax = sns.countplot(data=beer_df_clean, y='beer_name', order=top25_beers_order, palette='Blues_r')
plt.xlim(0, 9000)
plt.title('Top 25 Most Popular Beers')
sns.despine();
for p in ax.patches:
bar_end = p.get_width()
x = p.get_x() + p.get_width() + 100
y = p.get_y() + p.get_height() / 1.5
ax.annotate(format(bar_end, ',d'), (x,y))
The first quantitative variable we looked at was
beer_abv. There were some outliers present that we defined using the 1.5xIQR rule. We dropped the outliers that laid below the lower bound, because they represented "Low Alcohol Beer", which aren't of interest in our analysis. No outliers were dropped above the upper bound since we still needed their review data. In addition to outliers, thebeer_abvdistribution had a slight right skew, which was corrected with a log transformation.
The variables,review_overall,review_aroma,review_appearance,review_palate, andreview_tasteall had left skews, which was corrected with a cubed transformation.With the qualitative data,
state,beer_style,brewery_type, andbeer_name, we made count plots ranked from most frequent to least frequent value. Forbeer_styleandbeer_name, we only plotted the top 25 values, since these variables each had thousands of unique values.
Of the features investigated, the only operation that was necessary was the creation of temp tables (
state_unique,style_unique,brewery_unique) used for thestate,beer_style, andbrewery_typevaribales respectively. These temp tables were needed in order to prevent overcounting the frequency of each values, since our dataset has multiple reviews for each unique beer. A quick example: Beer A is an American IPA and has 30 different reviews, therefore represents 30 rows in the dataset. If we want to count the frequency of each beer stlye in the dataset, then we want to count 1 American IPA for Beer A instead of 30.
Let's start the bivariate exploration with looking at quantitative correlations.
variables = ['beer_abv', 'weighted_review', 'review_overall', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste']
corr_matrix = beer_df_clean[variables].corr()
corr_matrix.style.background_gradient(cmap='Blues').set_caption('Correlation coefficients of qualitative variables.').format( '{:0.2f}')
This simple pandas correaltion matrix serves its purpose. I also didn't know that pandas had a styling method, so I really wanted to try that out too.
Included in this matrix is
weighted_review, which obviously has higher correlation coefficients than the other review categories since it is a function of those categories. Therefore, in terms of correlations,review_overallis the primary variable of interest. By looking at the correlation coefficients of variables in relation toreview_overall, this will give us some insight into what categories influence a reviewer the most when giving their overall impression, orreview_overallscore, of a beer.Since we are primarily interested in the correlation coefficients of
review_overall, I think it would be useful to plot these values on their own.
data_corr = corr_matrix['review_overall'].drop(['review_overall', 'weighted_review']).reset_index().sort_values('review_overall', ascending=False)
plt.figure(figsize=(8,6))
ax = sns.barplot(data = data_corr, x='review_overall', y='index', palette='colorblind')
plt.title('Correlation with Overall Review', pad=15)
plt.ylabel('')
plt.xlabel('Correlation Coefficient', labelpad=10)
plt.xlim(0,0.85)
sns.despine();
for p in ax.patches:
bar_end = p.get_width()
x = p.get_x() + p.get_width() + 0.01
y = p.get_y() + p.get_height() / 1.8
ax.annotate(format(bar_end, '0.2f'), (x,y))
Here we can see that
review_tastehas the highest correlation with overall review score, andbeer_abvhas the lowest correlation with overall review score. This suggests that a beer's taste is the primary factor in determining a reviewer's overall impression of a beer. This makes sense.
Next I want to take a look at some realtionships between
weighted_reviewand other qualitative variables, such as:beer_style,beer_name,brewery_name,state,brewery_type.One important note to make is that any beer that is rated on BeerAdvocate needs at least 10 reviews to be ranked on the website. Therefore, we will keep that same threshold for this analysis.
# Are there any beer styles that have less than 10 reviews?
x = beer_df_clean.groupby('beer_style').count().weighted_review
x[x<10]
Japanese Rice Lager and Gueuze are the only beer styles that have less than 10 reviews. So we'll keep this in mind, and deal with them if they has one of the higher average ratings.
Let's start plotting
weighted_reviewvs.beer_style.
# top 25 rated beer styles
top25_beer_styles = beer_df_clean.groupby('beer_style').weighted_review.agg(['count', 'mean'])\
.sort_values(by=['mean'], ascending=False).reset_index().head(25)
top25_beer_styles
I'm also interested in displaying the number of reviews with each average rating, similar to a standard review site, which is why I included a count column. However, I will save that until later for explanatory visualizations.
So in fact Gueuze beer did make it to the top 25, therefore we'll have to deal with this quickly.
top25_beer_styles = beer_df_clean.groupby('beer_style').weighted_review.agg(['count', 'mean'])\
.sort_values(by=['mean'], ascending=False).reset_index()
top25_beer_styles = top25_beer_styles[top25_beer_styles['count']>9].head(25)
top25_beer_styles
There we go.
Let's get a dataframe to plot from.
beer_style_filter = top25_beer_styles.beer_style.values
top25_beerstyles_df = beer_df_clean.query("beer_style in @beer_style_filter")
top25_beerstyles_df
# plot top 25 beer styles
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
sns.pointplot(data=top25_beerstyles_df, x='weighted_review', y='beer_style', order=beer_style_filter)
plt.grid(True)
plt.title('Top 25 Beer Styles by Average Review Score', fontsize=14)
plt.ylabel('Beer Styles')
plt.xlabel('Average Review Score')
plt.yticks(weight=600);
Decided to use a pointplot since we aren't dealing with any 0 values and its less noisey than a violin or box plot.
The pointplot is also nice because it gives us an indication of the number of reviews a certain style has through the confidence interval of each point. Those with larger error bars will tend to have a lower number of reviews, and vice versa.
Next let's take a look at
weighted_reviewvs.beer_name.
# first filter out beers with <10 reviews.
beer_filter = beer_df_clean.groupby('beer_name').count().weighted_review
beer_filter1 = beer_filter[beer_filter<10].index
top25_beers_df = beer_df_clean.query("beer_name not in @beer_filter1")
top25_beers_df
# Top 25 rated beers
top25_beers = top25_beers_df.groupby('beer_name').weighted_review.agg(['count', 'mean'])\
.sort_values(by=['mean'], ascending=False).reset_index().head(25)
top25_beers
# plot top 25 beers
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
sns.pointplot(data=top25_beers_df, x='weighted_review', y='beer_name', order=top25_beers.beer_name.values, capsize=0.4)
plt.grid(True)
plt.title('Top 25 Beers by Average Review Score', fontsize=14)
plt.ylabel('Beer Names')
plt.xlabel('Average Review Score')
plt.yticks(weight=600);
Again the confidence intervals in this plot are very useful. For example, the beer "Three Sheets Barley Wine" is ranked 3rd with a rating of 4.6, however it's error bars are fairly large. This tells us that we aren't that confident with the mean we have calculated, which is due to the fact that this beer only has 12 reviews.
weighted_reviewvsbrewery_type
# Any brewery types with <10 reviews?
top_brew_type = beer_df_clean.groupby('brewery_type').weighted_review.agg(['count', 'mean'])\
.sort_values(by=['mean'],ascending=False)
top_brew_type
Nope. So let's go straight ahead and plot.
plt.figure(figsize=(10,12))
sns.set_theme(style='darkgrid')
sns.pointplot(data=beer_df_clean, x='weighted_review', y='brewery_type', order=top_brew_type.index)
plt.grid(True)
plt.title('Brewery Types by Average Review Score', fontsize=14)
plt.ylabel('Brewery Type')
plt.xlabel('Average Review Score')
plt.yticks(weight=600);
For brewery types, the only type with large error bars is "propietor", with only 117 reviews of beers from propietor breweries.
It is interesting how "planning" breweries, which are essentially small home breweries, have the highest ratings. Since these are small batch breweries, it could be the case that more time and attention goes into each batch, which translates to higher reviews.
On the other end, "large" breweries have the lowest average ratings. Its possible that the opposite phenomena is happening in large breweries, when compared to "planning" breweries.
weighted_review vs. brewery_name
# filter and create dataframe.
brew_filter = beer_df_clean.groupby('brewery_name').count().weighted_review
brew_filter1 = brew_filter[brew_filter<10].index
top25_brew_df = beer_df_clean.query("brewery_name not in @brew_filter1")
top25_brew_df
top25_brew = top25_brew_df.groupby('brewery_name').weighted_review.agg(['count', 'mean'])\
.sort_values(by=['mean'],ascending=False).head(25)
top25_brew
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
sns.pointplot(data=top25_brew_df, x='weighted_review', y='brewery_name', order=top25_brew.index, capsize=0.4)
plt.grid(True)
plt.title('Top 25 Breweries by Average Review Score', fontsize=14)
plt.ylabel('Brewery Names')
plt.xlabel('Average Review Score')
plt.yticks(weight=600);
weighted_review vs. state
# any states with <10 reviews?
top_states = beer_df_clean.groupby('state').weighted_review.agg(['count', 'mean'])\
.sort_values(by=['mean'],ascending=False).reset_index()
top_states['mean'] = top_states['mean'].round(2)
top_states
North Dakota is the only state with less than 10 reviews, but we won't drop it so that we can at least compare all 50 states.
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
sns.pointplot(data=beer_df_clean, x='weighted_review', y='state', order=top_states.state)
plt.grid(True)
plt.title('States by Average Review Score', fontsize=14)
plt.ylabel('States')
plt.xlabel('Average Review Score')
plt.yticks(weight=600, fontsize=12);
Oklahoma and Missouri seem to have signifcantly lower review ratings than the rest of the states. I wonder if the "large" breweries are located in either Oklahoma or Missouri, since they have the lowest review ratings?
beer_df_clean.query("state == 'Missouri'").brewery_type.value_counts()
beer_df_clean.query("state == 'Missouri'").brewery_name.value_counts()
So it looks like there about 15,000 reviews from the "large" brewery, Anheuser-Busch, which most likely is dragging down the average review score for beers brewed in Missouri, since "large" breweries also have the lowest average rating.
The last thing I want to do with this state data is create some sort of geospatial plot that might show some regional trends in average ratings. Let's give that a shot.
# First we need to add a state 'code' column to properly map the states in the map/
state_codes = {
"Alabama":"AL",
"Alaska":"AK",
"Arizona":"AZ",
"Arkansas":"AR",
"California":"CA",
"Colorado":"CO",
"Connecticut":"CT",
"Delaware":"DE",
"District of Columbia":"DC",
"Florida":"FL",
"Georgia":"GA",
"Hawaii":"HI",
"Idaho":"ID",
"Illinois":"IL",
"Indiana":"IN",
"Iowa":"IA",
"Kansas":"KS",
"Kentucky":"KY",
"Louisiana":"LA",
"Maine":"ME",
"Montana":"MT",
"Nebraska":"NE",
"Nevada":"NV",
"New Hampshire":"NH",
"New Jersey":"NJ",
"New Mexico":"NM",
"New York":"NY",
"North Carolina":"NC",
"North Dakota":"ND",
"Ohio":"OH",
"Oklahoma":"OK",
"Oregon":"OR",
"Maryland":"MD",
"Massachusetts":"MA",
"Michigan":"MI",
"Minnesota":"MN",
"Mississippi":"MS",
"Missouri":"MO",
"Pennsylvania":"PA",
"Rhode Island":"RI",
"South Carolina":"SC",
"South Dakota":"SD",
"Tennessee":"TN",
"Texas":"TX",
"Utah":"UT",
"Vermont":"VT",
"Virginia":"VA",
"Washington":"WA",
"West Virginia":"WV",
"Wisconsin":"WI",
"Wyoming":"WY"
}
top_states['state_code'] = top_states.state.replace(to_replace=state_codes)
top_states
import plotly.graph_objects as go
fig = go.Figure(data = go.Choropleth(
locations = top_states['state_code'],
z = top_states['mean'],
locationmode = 'USA-states',
text = top_states['state']+'<br>'+'# reviews: '+ top_states['count'].astype(str),
colorscale = 'Blues',
colorbar_title = 'Average<br>Review<br>Score',
colorbar = {'ticks':'outside'}
))
fig.update_layout(
title_text = 'Average Beer Rating by State<br>(Hover for more info)',
geo = dict(
scope = 'usa',
projection = go.layout.geo.Projection(type = 'albers usa'),
showlakes=True,
lakecolor='rgb(255, 255, 255)')
)
So interestingly enough this heatmap of the US does give us some additional insight.
For example, it looks like the coasts brew higher rated beer than the middle of the country. Except for Kansas and Nebraska which have the 1st and 9th highest average reviews respectively. Kansas tops the list with an average rating of 4.11, however it only has 132 reviews, which is realtively low.
In addition, it looks like the states surrounding the Great Lakes have high ratings as well. This is interesting, I wonder if it is because they have easy access to fresh water?
Next, I want to try and make a similar geospatial map with cities, and then figure how to overlay the two maps into one visualization.
# first we want to filter out cities with less than 10 reviews.
city_filter = beer_df_clean.city.value_counts()
city_filter = city_filter[city_filter>=10].index
top_cities_df = beer_df_clean.query('city in @city_filter')
top_cities_df
# create dataframe of cities ranked by average rating.
top_cities = top_cities_df.groupby('city').weighted_review.agg(['count', 'mean'])\
.sort_values(by=['mean'],ascending=False).reset_index()
top_cities['mean'] = top_cities['mean'].round(2)
top_cities
I am going to skip the pointplot for cities and go straight into the map.
First we need to append longitudes and latitudes to the
top_citiesdataframe as location references for the map.
top_cities = top_cities.merge(top_cities_df.groupby('city').first(), on='city')[['city', 'count', 'mean', 'longitude', 'latitude']]
top_cities
# drop any cities without location data.
top_cities.dropna(inplace=True)
top_cities
# round longitude and latitude values so that they aren't cumbersome when they are shown on the map.
top_cities[['longitude', 'latitude']] = top_cities[['longitude', 'latitude']].round(2)
top_cities
Before we plot, we're going to need a dummy row, since there are no cities with average scores less than 2. What this means is that the legend on the map won't show a marker for data points between 1-2. Rather it would only show markers for, 2-3, 3-4, and 4-5, which is incomplete.
Having a dummy row with NaN values will give us a data point we can use to populate a marker for review scores 1-2, even though there are no cities with an average score below 2.
top_cities.reset_index(drop=True, inplace=True)
top_cities = top_cities.append({'city':'Dummy row', 'count':np.nan, 'mean':np.nan, 'longitude':np.nan, 'latitude':np.nan}, ignore_index=True)
top_cities
Now we have a clean dataframe to use to create the map.
limits = [1,2,3,4]
colors = ['lightcyan', 'rgb(59,59,59)', 'cornflowerblue', 'rgb(241,105,19)']
fig = go.Figure()
# trace for dummy data, used to populate a 1-2 marker in the legend.
df_1_2 = top_cities.query("city == 'Dummy row'")
fig.add_trace(go.Scattergeo(
name = '1 - 2',
visible = 'legendonly',
showlegend = True,
lon = df_1_2['longitude'],
lat = df_1_2['latitude'],
marker = dict(
color = 'lightgrey',
line_color = 'rgb(40,40,40)',
line_width = 0.5,
size = 8)))
# city review data
for i in range(len(limits)):
df_sub = top_cities[top_cities['mean'].between(i+1, i+2)]
fig.add_trace(go.Scattergeo(
locationmode = 'USA-states',
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = df_sub['city']+'<br>'+'Review score: '+df_sub['mean'].astype(str)+'<br>'+'# reviews: '+df_sub['count'].astype(str),
name = f'{i+1} - {i+2}',
marker = dict(
color = colors[i],
opacity = 0.8,
size = 8,
line_color = 'rgb(40,40,40)',
line_width = 0.5)))
fig.update_layout(
title = dict(
text = 'Top Rated Cities for Drinking Beer<br>(Click on legend to toggle ratings)',
xanchor = 'center',
x = 0.5),
showlegend = True,
legend = dict(
title = 'Review Score',
x = 0.92,
y = 0.52),
geo = dict(
scope = 'usa',
landcolor = 'rgb(217,217,217)'))
Similar to the state map, many of the high rated cities are located on the coasts and around the great lakes.
No that we have two independent maps for states and cities, I want to try and merge both of them into one comprehensive map.
Individually, using color encodings works well, however if I want to combined them then I can't use color encodings for both. I think it would be best to keep the color encoding for the state portion of the map, and change the encoding for cities to color and shape.
colors = ['white', 'gold', 'crimson']
shapes = ['x', 'circle', 'star']
opacities = [1, 0.8, 1]
sizes = [9,8,10]
# state data
fig = go.Figure(go.Choropleth(
locations = top_states['state_code'],
z = top_states['mean'],
locationmode = 'USA-states',
text = top_states['state']+'<br>'+'# reviews: '+ top_states['count'].astype(str),
name = 'state',
colorscale = 'Blues',
colorbar_title = 'State<br>Review<br>Score',
colorbar = dict(
ticks = 'outside',
x = 0.95,
y = 0.55)))
# trace for dummy data, used to populate a 1-2 marker in the legend.
df_1_2 = top_cities.query("city == 'Dummy row'")
fig.add_trace(go.Scattergeo(
name = '1 - 2',
visible = 'legendonly',
showlegend = True,
lon = df_1_2['longitude'],
lat = df_1_2['latitude'],
marker = dict(
symbol = 'x-thin',
color = 'black',
line_color = 'rgb(40,40,40)',
line_width = 0.5,
size = 8)))
# city data
for i in range(len(colors)):
df_sub = top_cities[top_cities['mean'].between(i+2, i+3)]
fig.add_trace(go.Scattergeo(
locationmode = 'USA-states',
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = df_sub['city']+'<br>'+'Review score: '+df_sub['mean'].astype(str)+'<br>'+'# reviews: '+df_sub['count'].astype(str),
name = f'{i+2} - {i+3}',
marker = dict(
symbol = shapes[i],
color = colors[i],
size = sizes[i],
opacity= opacities[i],
line_color = 'black',
line_width = 0.75)))
fig.update_layout(
title = dict(
text = 'Top Rated States & Cities for Drinking Beer<br>(Click on legend to toggle ratings)',
xanchor = 'center',
x = 0.5),
showlegend = True,
legend = dict(
title = dict(
text = 'City Review Score',
side = 'top'),
xanchor = 'center',
x = 0.5,
y = -0.03,
itemsizing = 'trace',
orientation = 'h'),
geo = dict(
scope = 'usa',
projection = go.layout.geo.Projection(type = 'albers usa'),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)',
landcolor = 'rgb(217,217,217)'))
Adding a shape encoding does seem to help a little bit when combining the two maps. Cities with the highest ratings, 4-5, stand out more with the star shape encoding.
By combining the two maps we can see a nice correlation between high rated states and the number of high rated cities in that state.
Now I want to move onto exploring a few qualitative variables with the quantitative variable
beer_abv.Let's take a look at
beer_abvvs.beer_style.First we will need to drop all beers that have multiple reviews, so that the
beer_abvmeans aren't skewed by beers that have more reviews.
beer_unique = beer_df_clean.drop_duplicates(subset=['beer_name'])
beer_unique
top25_style_abv = beer_unique.groupby('beer_style').beer_abv.agg(['count', 'mean']).sort_values(by='mean', ascending=False)
top25_style_abv = top25_style_abv.query('count >= 10')
top25_style_abv = top25_style_abv.head(25)
top25_style_abv
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
sns.pointplot(data=beer_unique, x='beer_abv', y='beer_style', order=top25_style_abv.index)
plt.grid(True)
plt.title('Beer Style by Average ABV', fontsize=14)
plt.ylabel('Beer Style')
plt.xlabel('Average ABV')
plt.yticks(weight=600, fontsize=12);
The above plot shows the top 25 beer styles based on average ABV. The highest average ABV beer style is Eisbock, which has a fairly large error bar due to only having 11 unique Eisbock style beers in the dataset.
I've never heard of a "wine" brew, like barleywine or wheatwine, but they hold 3 of the top 5 positions with relatively small error bars so there should be an explanation for this.
A quick google search tells us that it is in fact one of the most intense and strongest beer styles brewed. Cool.
beer_abvvs.brewery_type
top_brewtype_abv = beer_unique.groupby('brewery_type').beer_abv.agg(['count', 'mean'])\
.sort_values(by='mean', ascending=False).head(25)
top_brewtype_abv
Alright so it looks like "proprietor" breweries have the highest average ABV. However, there are only 7 values making up this mean. So let's take a closer look at these values and see if it is worth keeping "proprietor" breweries.
beer_unique.query("brewery_type == 'proprietor'")
So it doesn't seem like there are any beers in specific that are skewing the mean. However, since we have been holding to the minimum of 10 reviews/values to be considered significant during this analysis, let's stay consistent and drop "proprietor" breweries.
top_brewtype_abv = top_brewtype_abv.query("count >= 10")
top_brewtype_abv
plt.figure(figsize=(10,12))
sns.set_theme(style='darkgrid')
sns.pointplot(data=beer_unique, x='beer_abv', y='brewery_type', order=top_brewtype_abv.index)
plt.grid(True)
plt.title('Brewery Types by Average ABV', fontsize=14)
plt.ylabel('Brewery Type')
plt.xlabel('Average ABV')
plt.yticks(weight=600);
It's interesting to see "large" breweries at the bottom of average ABV. It makes sense though, since large breweries are brewing for profit at scale, and making lower ABV beer is cheaper and faster than higher ABV beer.
In this section we focused on a few quantitative variables,
review_overall,weighted_review, andbeer_abv. We compared these variables to a host of different qualitative variables, such asbeer_name,brewery_type,stateetc.Some interesting findings with these variables:
review_tastehas the highest correlation withreview_overall, whilebeer_abvhas the lowest correlation withreview_overall.- Stout beers outperform in terms of average review score for both style and specific beers.
- 'Planning' breweries, such as home breweries, have the highest average review scores. While 'large' breweries have the lowest average review scores.
- Based on the geospatial maps, we found that states located on the coasts and around the great lakes tend to have higher average review scores. We found the same trend with cities.
- When it came to
beer_abv, we found that wine style beers, such as barleywine and wheatwine, have some of the highest average ABV values of any beer style, with 3 out of the top 5 ranking spots inBeer Style by Average ABV.- We also found that 'large' breweries have the lowest average ABV values. This makes sense, especially in tandem with having the lowest average review scores, since large breweries optimize for mass production. Part of that optimization causes large breweries to sacrifice quality, and produce beers with low ABV levels, which in general are faster and cheaper to brew.
For this section, I want to start by looking at the correlations between
review_overall, which is the reviewer's overall impression of the beer, and the other main variables of interest,beer_abv,review_aroma,review_appearance,review_palate,review_taste, with respect to the categorical variablebrewery_type.We already found these correlations at the beginning of the 'Bivariate Exploration' section, without respect to
brewery_type. The goal here is to see if certain brewery types have any significant affect on the correlations.The practical question being answered here is: Does a certain brewery type have higher predictive value when it comes to
review_overallscores?
# loops through each brewery type and corr. variable to find a corr. coeff. for each combination w/ respect to `review_overall`.
brewery_types = beer_df_clean.brewery_type.unique()
variables = ['beer_abv', 'review_aroma', 'review_appearance', 'review_palate', 'review_taste']
corr_list = []
for i in brewery_types:
df = beer_df_clean[beer_df_clean['brewery_type'] == i]
for var in variables:
corr = df['review_overall'].corr(df[var])
corr_dict = {'brewery_type':i, 'corr_variable':var, 'corr':corr}
corr_list.append(corr_dict)
corr_list
Above we created a list of dicts, where each dict contains a correlation coefficient, for each correlation variable, for each brewery type.
We are essentially running the same code we did when we found the corr. coeff. of these variables at the beginning of the 'Bivariate Exploration' section. Except here, we're finding the corr. coeff. for each brewery type, instead of the entire dataframe
beer_df_clean.Now from this list of dicts we can easily create a dataframe to plot with using seaborn.
corr_df = pd.DataFrame(corr_list)
corr_df
plt.figure(figsize=(15,10))
sns.set_style('darkgrid')
sns.pointplot(data=corr_df, x='corr_variable', y='corr', hue='brewery_type',
hue_order = top_brew_type.index, linestyles=('dotted'))
plt.title('Correlation with Overall Review by Brewery Type', weight='bold', fontsize=20)
plt.ylabel('Correlation Coefficient', labelpad=20, fontsize=15)
plt.xlabel('Correlation Variable', labelpad=20, fontsize=15)
plt.legend(title='Brewery Type');
# brewery type value counts for reference to the above plot.
beer_df_clean.brewery_type.value_counts()
So we actually find something interesting here. Most of the brewery types tend to have realtively similar corr. coeff., except for 'proprietor' and 'large' breweries.
For 'proprietor' breweries, the high variance in corr. coeff. could be due to the fact that there are only 117 data points. This would need some more investigation to show whether or not this is the case, which we won't get into for this anlaysis.
'Large' breweries don't have this potential issue, with almost 70,000 data points. If we ignore 'proprietor' breweries, 'large' breweries have the highest correlations across all variables. What this means is that
beer_abvand the other review varibales have a higher predictive value for 'large' brewery beers, in terms ofreview_overallscores.
Next, I want to move away from correlations and focus on more practical insights and questions for everyday beer drinkers and enthusiasts.
Therefore, I want to start by looking into the following question: What high ABV beers have the highest ratings? In other words, what highly rated beers have the potential to intoxicate you quickly. With this information, your goal could be to get drunk fast or be informed and cautious when drinking high ABV beers.
I will be defining a high abv beer as any beer above the upper quartile of
beer_abv.After analyzing the upper quartile, I will answer the same question for values below the lower quartile, and for values in the IQR.
beer_unique.beer_abv.describe()
We first need to find what ABV values are above the upper quartile.
upperqrt_abv = beer_df_clean.query('beer_abv >= 8')
upperqrt_abv
top25_upper_abv = upperqrt_abv.groupby('beer_name').weighted_review.agg(['count', 'mean'])
top25_upper_abv = top25_upper_abv.query('count >= 10').sort_values('mean', ascending=False).reset_index().head(25)
top25_upper_abv = top25_upper_abv.merge(beer_unique.query('beer_name in @top25_upper_abv.beer_name'), on='beer_name')\
[['beer_name', 'count', 'mean', 'beer_abv']]
top25_upper_abv
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
ax = sns.pointplot(data=upperqrt_abv, x='weighted_review', y='beer_name', order=top25_upper_abv.beer_name)
plt.grid(True)
plt.title('Top Rated High ABV Beers' , fontsize=14)
plt.ylabel('Beer Name')
plt.xlabel('Average Rating')
plt.yticks(weight=600, fontsize=12);
We know that all of these beers are in the upper quartile, but there's a large range of values in the upper quartile. I think it would be useful to add each of these beer's ABV value on the plot for quick reference and comparison.
# create list of new tick labels with ABV values
upper_abv = top25_upper_abv.beer_abv.tolist()
labels = [item.get_text() for item in ax.get_yticklabels()]
labels_abv = []
for i, label in enumerate(labels):
new_label = f'{label} [{upper_abv[i]} ABV]'
labels_abv.append(new_label)
labels_abv
# replot
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
ax = sns.pointplot(data=upperqrt_abv, x='weighted_review', y='beer_name', order=top25_upper_abv.beer_name)
plt.grid(True)
plt.title('Top Rated High ABV Beers' , fontsize=14)
plt.ylabel('Beer Name')
plt.xlabel('Average Rating')
ax.set_yticklabels(labels_abv, fontdict={'fontsize':12, 'fontweight':600});
That looks better and a bit more informative in my opinion.
This plot shows us the top 25 rated beers in the upper quartile of ABV. This would be useful for anyone who is interested in drinking beers with above average ABV, while still taking into consideration other variables such as taste, feel, smell, and look of their beer.
Next, I want to do the same anlaysis with the lower quartile of
beer_abv. This will attempt to answer the question: What low ABV beers have the highest ratings? In other words, what good beer can I drink for an extended amount of time, for example over the course of a day, without getting too intoxicated.
# What ABV value defines the lower quratile?
beer_unique.beer_abv.describe()
lowerqrt_abv = beer_df_clean.query('beer_abv <= 5.2')
lowerqrt_abv
top25_lower_abv = lowerqrt_abv.groupby('beer_name').weighted_review.agg(['count', 'mean'])
top25_lower_abv = top25_lower_abv.query('count >= 10').sort_values('mean', ascending=False).reset_index().head(25)
top25_lower_abv = top25_lower_abv.merge(beer_unique.query('beer_name in @top25_lower_abv.beer_name'), on='beer_name')\
[['beer_name', 'count', 'mean', 'beer_abv']]
top25_lower_abv
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
ax1 = sns.pointplot(data=lowerqrt_abv, x='weighted_review', y='beer_name', order=top25_lower_abv.beer_name)
plt.grid(True)
plt.title('Top Rated Low ABV Beers' , fontsize=14)
plt.ylabel('Beer Name')
plt.xlabel('Average Rating')
lower_abv = top25_lower_abv.beer_abv.tolist()
labels_lower = [item.get_text() for item in ax1.get_yticklabels()]
labels_low_abv = [f'{label} [{lower_abv[i]} ABV]' for i, label in enumerate(labels_lower)]
ax1.set_yticklabels(labels_low_abv, fontdict={'fontsize':12, 'fontweight':600});
One interesting thing to note between both the lower and upper quartile ABV plots, is that it seems as if there is a "sweet spot" for high ratings in both quartiles. For the lower quartile, beers with ABV between 4 and 5.2 dominate the list, apart from "Southampton Berliner Weisse" which ranks second. For the upper quartile, the same is true for beers between 8 and 12 ABV.
Finally, let's take a look at beers in the IQR/between the lower and upper quartiles. Maybe someone just wants to know the highest rated 'medium' strength beers, in terms of ABV. So let's go ahead and do that.
# What ABV value defines the median quratile?
beer_unique.beer_abv.describe()
medianqrt_abv = beer_df_clean.query('5.2 < beer_abv < 8.0')
medianqrt_abv
medianqrt_abv.beer_abv.describe()
top25_median_abv = medianqrt_abv.groupby('beer_name').weighted_review.agg(['count', 'mean'])
top25_median_abv = top25_median_abv.query('count >= 10').sort_values('mean', ascending=False).reset_index().head(25)
top25_median_abv = top25_median_abv.merge(beer_unique.query('beer_name in @top25_median_abv.beer_name'), on='beer_name')\
[['beer_name', 'count', 'mean', 'beer_abv']]
top25_median_abv['beer_abv'] = top25_median_abv.beer_abv.round(1)
top25_median_abv
Hoppy Birthday isn't rounding properly. It has an ABV value of 5.25, but is rounding to 5.2 due to a floating point limitation. No other value has this problem, so let's just fix Hoppy Birthday.
# confirm ABV value of Hoppy Birthday.
beer_df_clean.query("beer_name == 'Hoppy Birthday'").beer_abv.head(1)
top25_median_abv.at[0, 'beer_abv'] = 5.3
top25_median_abv
# plot
plt.figure(figsize=(12,15))
sns.set_theme(style='darkgrid')
ax2 = sns.pointplot(data=medianqrt_abv, x='weighted_review', y='beer_name', order=top25_median_abv.beer_name)
plt.grid(True)
plt.title('Top Rated Medium ABV Beers' , fontsize=14)
plt.ylabel('Beer Name')
plt.xlabel('Average Rating')
median_abv = top25_median_abv.beer_abv.tolist()
labels_median = [item.get_text() for item in ax2.get_yticklabels()]
labels_low_abv = [f'{label} [{median_abv[i]} ABV]' for i, label in enumerate(labels_median)]
ax2.set_yticklabels(labels_low_abv, fontdict={'fontsize':12, 'fontweight':600});
Sweet, looks good. This plot represents the top rated beers with average ABV values.
beer_df_clean
top25_upper_abv
The realtionship looked at during this section was the correlation between
review_overalland the variablesbeer_abv,review_aroma,review_appearance,review_palate, andreview_taste, with respect tobrewery_type. We found that 'large' breweries have the highest correlations across all variables, aside from 'proprietor' breweries who only had 117 data points with the potential for a large margin of error.Next, we took a look at a few multivariate rankings instead of correlations. These variables were
beer_name,weighted_review, andbeer_abv, where each beer was grouped into its respective quartile in terms ofbeer_abv. The aim here was to answer practical questions a beer drinker might have about the best beers to drink for different situations. For example someone might ask, what is the best/highest rated beer to drink over the course of the day? To answer this question, you may want to look for the highest rated beers that are in the lower quartile ofbeer_abv. In other words, you would want a beer with a high rating and a relatively low ABV content, so that you could drink it over the course of a day.These were the top rated beers for each
beer_abvquartile:
- Lower Quartile: Carnie Fire at 5.0 ABV
- Interquartile Range: Hoppy Birthday at 5.3 ABV
- Upper Quartile: Rare Bourbon County Stout at 13.0 ABV
One intersting thing found in the rankings of lower quartile beers, was that the 2nd highest rated beer 'Southampton Berliner Weisse' has 2.0 ABV content, while having 4x the number of reviews as Carnie Fire (1st ranked).
When it came to the upper quartile of
beer_abv, the top 3 ranked beers were all stouts. In addition, 16 of the top 25 beers in the upper quartile were either stouts or stout variations. From the 'Top Rated High ABV Beers' plot, there are 12 beers that have 'stout' in their name, while 4 do not, Russian Roulette, Sea Monster, Darkness - Bourbon Barrel Aged, and The Abyss. With well over half of the highest rated, high ABV beers being stouts, recommending a stout to someone looking for high ABV beers would be a safe bet.